Specify attention-23 kernel and relax assertion in prepare qkv by titaiwangms · Pull Request #27217 · microsoft/onnxruntime

titaiwangms · 2026-01-31T00:16:44Z

This pull request updates the attention kernel selection logic and clarifies support for unidirectional (causal) attention in the CUDA attention implementation. The main changes focus on improving documentation, removing outdated comments, and explicitly setting the kernel type for better maintainability and clarity.

Kernel selection and configuration improvements:

Explicitly set the kernel_type field to AttentionKernel_Unfused in the AttentionData structure to clarify which kernel is being used and improve future extensibility.

Documentation and code clarity:

Added comments to clarify that unidirectional (causal) attention is supported by several attention kernel implementations, and that the TRT fused runner is only used for non-unidirectional cases, as enforced elsewhere.
Removed outdated TODO comments regarding parameter continuation and kernel selection, as these are now handled more explicitly in the code. [1] [2]

Copilot

Pull request overview

This PR improves the attention kernel selection logic for the CUDA Attention operator by explicitly setting kernel types and removing outdated code. The changes enhance code clarity and maintainability while enabling proper support for causal (unidirectional) attention with the unfused kernel path.

Changes:

Explicitly set kernel_type to AttentionKernel_Unfused in the Attention operator to clarify kernel selection
Remove outdated TODO comments about parameter handling and kernel selection that are now properly implemented
Relax the assertion in PrepareQkv_MHA_NoPast that incorrectly prevented causal attention, replacing it with a clarifying comment about which kernels support unidirectional attention

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
onnxruntime/core/providers/cuda/llm/attention.cc	Removes obsolete TODOs and explicitly sets kernel_type to AttentionKernel_Unfused, making kernel selection more explicit and maintainable
onnxruntime/contrib_ops/cuda/bert/attention_prepare_qkv.cu	Removes overly restrictive assertion blocking causal attention and adds explanatory comment about kernel support for is_unidirectional flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

titaiwangms · 2026-02-03T20:29:14Z

Windows GPU CI error will be fixed in #27206

specify attention kernel and relax assertion

ed1f86a

titaiwangms requested review from Copilot and tianleiwu January 31, 2026 00:16

Copilot started reviewing on behalf of titaiwangms January 31, 2026 00:17 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

tianleiwu approved these changes Jan 31, 2026

View reviewed changes

xadupre approved these changes Feb 2, 2026

View reviewed changes

titaiwangms enabled auto-merge (squash) February 3, 2026 20:29

titaiwangms merged commit 25a6fdc into main Feb 3, 2026
112 of 141 checks passed

titaiwangms deleted the titaiwang/fix_attention_mha_spec branch February 3, 2026 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify attention-23 kernel and relax assertion in prepare qkv#27217

Specify attention-23 kernel and relax assertion in prepare qkv#27217
titaiwangms merged 1 commit intomainfrom
titaiwang/fix_attention_mha_spec

titaiwangms commented Jan 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

titaiwangms commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

titaiwangms commented Jan 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

titaiwangms commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants